post thumbnail

Bloom Filter for Web Scraping Deduplication: Principle, Python, and Redis

**Bloom Filters Explained: Space-Efficient Deduplication for Big Data** Discover how Bloom Filters use binary hashing to check massive datasets with minimal memory (122KB/1M items). Learn Python/Redis implementations for web crawling, cache systems, and distributed apps. Includes code examples for false-positive-tolerant existence checks. Optimize performance in data-heavy applications today!

2025-10-04

A Bloom Filter for web scraping deduplication solves a common problem in crawler systems: fast existence checks across massive datasets. It is especially useful when you can tolerate a small false-positive rate, such as preventing cache penetration, URL deduplication, and large-scale “seen-before” checks.

To understand the idea quickly, start with a classic “minimum mice” thought experiment. Then, we will connect that logic to Bloom Filters and finish with a practical Bloom Filter for web scraping deduplication implementation in Python and Redis.

Outbound references (docs):


What problem does a Bloom Filter solve?

In large-scale scraping, you often need to answer one question quickly:

A traditional set or database index works, but it consumes more memory and may slow down under heavy concurrency. In contrast, a Bloom Filter uses a compact bit array and hash functions. As a result, it can handle large-scale existence checks with very low memory cost.

This is why a Bloom Filter for web scraping deduplication becomes a popular choice in distributed crawler systems.


The 7 mice problem and the Bloom Filter intuition

Imagine 100 bottles of liquid. One bottle contains poison, and a single drop kills a mouse within 24 hours. You want to identify the poisoned bottle in one day, using the fewest mice.

Instead of using 100 mice, you can solve it with only 7 mice, because:

Step 1: Encode bottles in binary (7 bits)

Step 2: Feed based on bit positions

Assign Mouse 1–Mouse 7 to the 7 bit positions. Then:

Step 3: Decode deaths after 24 hours

For example, if Mice 2, 4, 5, and 7 die, the binary result is:

0101101 → decimal 45 → Bottle 45 is poisoned.

This demonstrates the core idea: you can represent a large search space using compact bits. Likewise, a Bloom Filter stores membership hints inside a bit array.


How Bloom Filters work

A Bloom Filter has two main parts:

  1. A bit array (each position is 0 or 1)
  2. Multiple hash functions (or a hash function with multiple seeds)

Add an element

  1. Compute k hash values
  2. Map them into bit array positions
  3. Set those bits to 1

Check existence

  1. Compute the same k hashes
  2. If any bit is 0, the element definitely does not exist
  3. If all bits are 1, the element probably exists

Key conclusion:

That false-positive possibility is the trade-off for speed and memory savings in a Bloom Filter for web scraping deduplication.


Pros and cons of Bloom Filters

Pros

Cons


Python Bloom Filter implementation (bitarray + mmh3)

Install dependencies:

pip install bitarray mmh3

Implementation:

from bitarray import bitarray
import mmh3

class BloomFilter(set):
    def __init__(self, size: int, hash_count: int):
        super().__init__()
        self.bit_array = bitarray(size)
        self.bit_array.setall(0)
        self.size = size
        self.hash_count = hash_count

    def add(self, item: str):
        for i in range(self.hash_count):
            index = mmh3.hash(item, i) % self.size
            self.bit_array[index] = 1
        return self

    def __contains__(self, item: str) -> bool:
        return all(
            self.bit_array[mmh3.hash(item, i) % self.size]
            for i in range(self.hash_count)
        )

# Usage
bloom = BloomFilter(size=100, hash_count=10)
urls = ["www.baidu.com", "www.qq.com", "www.jd.com"]

for url in urls:
    bloom.add(url)

print("www.baidu.com" in bloom)  # True
print("www.fake.com" in bloom)   # False (or rarely True)

Tip: increase size and tune hash_count when you store more URLs, because that reduces false positives.


Redis integration for distributed deduplication

A single-machine Bloom Filter works well, but distributed crawlers need shared state. Therefore, Redis Bloom Filter support is a common upgrade path.

Install client:

pip install redis

Example (RedisBloom commands):

import redis

r = redis.Redis(host="localhost", port=6379, db=0)

# Create Bloom filter:
# error_rate=0.01, capacity=100000
r.execute_command("BF.RESERVE", "myBloomFilter", 0.01, 100000)

# Add and check
items = ["url1", "url2"]
for item in items:
    r.execute_command("BF.ADD", "myBloomFilter", item)

print(r.execute_command("BF.EXISTS", "myBloomFilter", "url1"))  # 1
print(r.execute_command("BF.EXISTS", "myBloomFilter", "urlX"))  # 0 (usually)

In a real distributed workflow, replace localhost with your shared Redis host. Then every crawler node uses the same Bloom Filter state, so deduplication stays consistent.


Conclusion

A Bloom Filter for web scraping deduplication provides an efficient trade-off: it dramatically reduces memory usage and keeps existence checks fast, while accepting a small false-positive rate. With Python, you can implement it locally in minutes. With Redis, you can scale the same idea across distributed crawler clusters.